Intermediate Workshop of the PRIN 2022 project, 5-6 February 2026, Bologna
outline
some unsupervised learning methods take as input a dissimilarity matrix
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous and 1 categorical variables
intuition
one might consider purple and blue closer than e.g. purple and yellow
Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
The Euclidean distance \(\longrightarrow\) shared information is over-counted
The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted
association-based pairwise distance
Association-based for continuous: Mahalanobis distance
Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is
\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where
\([\cdot]^{\odot 1/2}\) denotes the element-wise square root
\({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix
\({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1
To distance matrix \({\bf D}_{tvd}\) is defined using the so-called delta framework2 a general way to define categorical data distances.
Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.
\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) is the super-indicator matrix, with \(Q^{*}=\sum_{j=1}^{Q_c} q_j\)
\({\Delta}_j\) is the category dissimilarity matrix for variable \(j\), i.e., the \(j\)th diagonal block of the block-diagonal matrix \({\Delta}\).
setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1 (2)
Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):
\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]
We refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix
\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]
where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1 (3)
Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\) The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as
\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]
where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).
TVD-based dissimilarity matrix is, therefore,
\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]
A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:
\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]
this distance only accounts for correlations or associations among variables of the same type
no continuous–categorical interactions are considered.
define \(\Delta_{int}\), that accounts for the interactions and augment \(\Delta_{(tvd)}\)
\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]
where
\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and
\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where \(\alpha=\frac{1}{Q_{c}}\).
What is \(\Delta^{int}\)?
consider \({\bf D}_{mah}\) and sort it to identify the neighbors for each observation.
set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)
for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:
classify the observations using the prior corrected1 decision rule
\[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]
compute \(\delta_{int}^{j}(a,b)\) as the balanced accuracy2 (average of class-wise sensitivities) \[ \Phi_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]
well separated or not
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & \color{indianred}{0.94} & \cdot & \cdot \\ \color{indianred}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{indianred}{0.4} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{indianred}{0.4} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & \color{indianred}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.4 & \cdot & 0 & \cdot\\ \color{indianred}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & \color{indianred}{0.54} & \cdot \\ 0.4 & \color{indianred}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & \color{indianred}{0.55} \\ 0.4 & 0.54 & 0 & \cdot \\ 0.39 & \color{indianred}{0.55} & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.4 & 0.54 & 0 & \color{indianred}{0} \\ 0.39 & 0.55 & \color{indianred}{0} & 0 \end{pmatrix} \]
Spectral clustering: a graph partitioning problem
Graph representation
a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)
the affinity matrix \({\bf A}\)
the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups
| . | a | b | c | d |
|---|---|---|---|---|
| a | 0 | 0 | w_ac | 0 |
| b | 0 | 0 | w_cb | w_bd |
| c | w_ca | w_cb | 0 | w_cd |
| d | 0 | w_db | w_dc | 0 |
Spectral clustering: making the graph easy to cut
An approximated solution to the graph partitioning problem:
\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]
\(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s
Spectral clustering: solution and performance
Recent benchmarking studies1: SC works well, particularly in case of non-convex and overlapping clusters
toy experiment
does \(\delta_{int}^{j}(a,b)\) of help?
a toy scenario
a latent cluster membership
1 signal categorical variable partially associated to the latent clusters
4 categorical variables: 1 signal, 3 noise, mutually independent
2 continuous variables, unrelated to the latent clusters, but with discriminant power for some categories of the signal categorical variable
compared methods
gower dissimilarity: a straightforward option
naive SC for mixed-type 1
GUDMM2: an entropy based approach that takes into account inter- and intra-variable relation
unbiased association-based approach3: association-based, but no interactions considered
toy experiment
Final considerations and future work
association-based measures aim to go beyond match/mismatch of categories
when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful
measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters
commensurability makes the by-variable contributions to distance even
computationally demanding (but it can be made bearable)
\(\pi_{nn}\) tuning and regularization of \(\delta_{int}\)’s to reduce variability
main references
Le, S. Q. and T. B. Ho (2005a). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mbuga, F. and C. Tortora (2021a). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.
Mousavi, E. and M. Sehhati (2023a). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.
Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025a). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.